Building Domain-Specific Taggers without Annotated (Domain) Data
نویسندگان
چکیده
Part of speech tagging is a fundamental component in many NLP systems. When taggers developed in one domain are used in another domain, the performance can degrade considerably. We present a method for developing taggers for new domains without requiring POS annotated text in the new domain. Our method involves using raw domain text and identifying related words to form a domain specific lexicon. This lexicon provides the initial lexical probabilities for EM training of an HMM model. We evaluate the method by applying it in the Biology domain and show that we achieve results that are comparable with some taggers developed for this domain.
منابع مشابه
Rapid Adaptation of POS Tagging for Domain Specific Uses
Part-of-speech (POS) tagging is a fundamental component for performing natural language tasks such as parsing, information extraction, and question answering. When POS taggers are trained in one domain and applied in significantly different domains, their performance can degrade dramatically. We present a methodology for rapid adaptation of POS taggers to new domains. Our technique is unsupervi...
متن کاملPointwise Prediction and Sequence-Based Reranking for Adaptable Part-of-Speech Tagging
This paper proposes an accurate method for partof-speech (POS) tagging that is highly domain-adaptable. The method is based on an assumption that the POS transition tendencies do not depend on domains, and has the following three characteristics: 1) it is trainable from partially annotated data, 2) it uses efficiently trainable pointwise POS taggers to allow for active learning, and 3) is more ...
متن کاملMethods Paper: Heuristic Sample Selection to Minimize Reference Standard Training Set for a Part-Of-Speech Tagger
Part-of-speech tagging represents an important first step for most medical natural language processing (NLP) systems. The majority of current statistically-based POS taggers are trained using a general English corpus. Consequently, these systems perform poorly on medical text. Annotated medical corpora are difficult to develop because of the time and labor required. We investigated a heuristic-...
متن کاملPOS Tagger Combinations on Hungarian Text
In this paper we will briefly survey the key results achieved so far in Hungarian POS tagging and show how classifier combination techniques can aid the POS taggers. Methods are evaluated on a manually annotated corpus containing 1.2 million words. POS tagger tests were performed on single-domain, multiple domain and cross-domain test settings, and, to improve the accuracy of the taggers, vario...
متن کاملFast Domain Adaptation for Part of Speech Tagging for Dialogues
Part of speech tagging accuracy deteriorates severely when a tagger is used out of domain. We investigate a fast method for domain adaptation, which provides additional in-domain training data from an unannotated data set by applying POS taggers with different biases to the unannotated data set and then choosing the set of sentences on which the taggers agree. We show that we improve the accura...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2007